System and method to estimate end-to-end video frame delays

ABSTRACT

System and method to calculate video frame delay in a video stream received by a telecommunications endpoint, the method including: locating reference features characteristic of content in the received video stream; calculating, by use of a processor, reduced reference features from the located reference features; receiving reduced reference features of a transmitted video stream, the transmitted video stream corresponding to the received video stream; calculating, by use of a processor, a received trajectory of the reduced reference features from the received video stream; calculating, by use of a processor, a transmitted trajectory of the reduced reference features from the transmitted video stream; and calculating, by use of a processor, video frame delay as a time shift between the received trajectory and the transmitted trajectory.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.13/706,383, filed on Dec. 6, 2012, the content of which is herebyincorporated by reference in its entirety.

BACKGROUND

1. Field of the Invention

Embodiments of the present invention generally relate to detection ofdegraded quality of a video transmission, and, in particular, to asystem and method for using face detection to detect and correct forexcessive end-to-end video frame delays in order to improve videoquality.

2. Description of Related Art

Often during a live interview on TV, an interviewer in a studio may betalking with an interviewee at a remote location. There may be anappreciable delay for the video and audio signal going back and forth.This tends to create ambiguous verbal cues as to whether one person hasstopped talking and is expecting the other person to start talking, andso forth. As a result, the interview and interviewee may begin talkingover one another, and then they both stop and wait for the other personto continue talking, and so forth. This scenario is one manifestation ofexcessive end-to-end video frame delays. In another manifestation, theremay be a relatively high differential delay between the audio and videoportions of an interview, such that there is a noticeable and annoyingtiming mismatch between spoken words that are heard and video of aspeaker speaking the spoken words.

Improving and maintaining high video quality during adverse networkconditions is important for wide deployments of video over IP networksthat inherently lack end-to-end quality of service (“QoS”) guarantees.Application-layer quality assurance is typically enhanced by monitoringvideo frame delay in real-time, detecting degradation, and takingappropriate action when the video frame delay increases unacceptably. Akey step in the process, detection of high video frame delay inreal-time, requires light-weight video metrics that can be computed withlow computational overheads and communicated to the sending side withsmall transmission overheads.

End-to-end frame delay is an important metric impacting video Quality ofExperience (“QoE”). End-to-end frame delay is defined as the differencebetween the time of capture of a frame at the source and the time theframe is displayed at the destination. High frame delays can render avideo conversation useless and can contribute to lip-synching problems.As the audio and video streams in video conferencing and video phonestypically take different paths, understanding the end-to-end frame delayis important in QoE monitoring and potentially debugging.

When a video system is operational, frame delays can be computed byinserting a watermark in parts of the image not visible to the user.Watermarking involves embedding timing information into video streamsimages such that the embedded timing information can be used to identifymatching frames between the sent and received streams. Frames with thesame watermark values on the sent and received sides of a video streamare determined and their timing information are compared to compute theend-to-end frame delay. The clocks of the machines computing the metricneed to be synchronized. A disadvantage is that watermarks may becomedistorted or obliterated during transcoding operations.

Frame delay may be computed by synchronizing sent and received frames,and then using the timing information of synchronized frames to computethe frame delay. Frame synchronization typically relies on imageprocessing based techniques and is time consuming especially in thepresence of frame losses, transcoding, changes in frame rates andresolution. Hence computing frame delay by relying on framesynchronization is not suitable for real time operations.

End-to-end frame delay measurements, while important for QoE, aretypically not measured or reported to the users during a videoconference or a video call. Therefore, a need exists to provide aprocess to measure delays between a sent and received video stream, inorder to provide end-to-end frame delay measurements, and ultimatelyimproved customer satisfaction.

SUMMARY OF THE INVENTION

Embodiments in accordance with the present invention address estimatingend-to-end frame delay for video streams subjected to transcoding/mixingin video conferencing applications. The technique is computationallylightweight and agnostic to the used video decoder, frame size, framerate, and bit rate.

Embodiments of the present invention generally relate to videoimpairments, and, in particular, to a system and method for using facedetection in estimating frame delay, thereby exploiting characteristicsof video content in applications such as video conferencing, which oftenincludes relatively few speakers and a relatively low amount of motion.In such applications, motion is concentrated mainly around the face,making the face an area of interest. Loss of synchronization in facialregions is more likely to be noticed by subjective users. Embodiments inaccordance with the present invention use a novel frame delay estimationtechnique that focuses on a box surrounding the faces. Embodiments inaccordance with the present invention may measure how trajectories of acharacteristic of the box differ between sent and received frames undernetwork degradation.

The difference in characteristic of the box between sent and receivedframes is a lightweight indicator (i.e., an indicator that is notresource intensive to compute), in contrast to a comparison of thecontents of the boxes, which is relatively more resource intensive tocompute. Resources may include processing time, memory usage,transmission-related costs, and so forth. Tracking the speed of facialmovement based on the box location changes should detect problems withthe quality of service that are severe enough to warrant correctiveaction. For example, if the difference in box locations shows that aface has been found in the wrong place, by more than a de minimusamount, then the difference is an indication of a severe problem.

Embodiments in accordance with the present invention may provide amethod to detect video frame delay in a video stream received by atelecommunications endpoint, the method including: locating referencefeatures characteristic of content in the received video stream;calculating, by use of a processor, reduced reference features from thelocated reference features; receiving reduced reference features of atransmitted video stream, the transmitted video stream corresponding tothe received video stream; calculating, by use of a processor, areceived trajectory of the reduced reference features from the receivedvideo stream; calculating, by use of a processor, a transmittedtrajectory of the reduced reference features from the transmitted videostream; and calculating, by use of a processor, video frame delay as atime shift between the received trajectory and the transmittedtrajectory.

Embodiments in accordance with the present invention may provide asystem to detect video frame delay in a video stream received by atelecommunications endpoint, the system including: a location moduleconfigured to locate reference features characteristic of content in thereceived video stream; a processor configured to calculate reducedreference features from the located reference features; a receiverconfigured to receive reduced reference features of a transmitted videostream, the transmitted video stream corresponding to the received videostream; a processor configured to calculate a distance between thereduced reference features in the received video stream and the reducedreference features of the transmitted video stream; a processorconfigured to calculate a received trajectory of the reduced referencefeatures from the received video stream; a processor configured tocalculate a transmitted trajectory of the reduced reference featuresfrom the transmitted video stream; and a processor configured tocalculate video frame delay as a time shift between the receivedtrajectory and the transmitted trajectory.

The preceding is a simplified summary of embodiments of the disclosureto provide an understanding of some aspects of the disclosure. Thissummary is neither an extensive nor exhaustive overview of thedisclosure and its various embodiments. It is intended neither toidentify key or critical elements of the disclosure nor to delineate thescope of the disclosure but to present selected concepts of thedisclosure in a simplified form as an introduction to the more detaileddescription presented below. As will be appreciated, other embodimentsof the disclosure are possible utilizing, alone or in combination, oneor more of the features set forth above or described in detail below.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and still further features and advantages of the presentinvention will become apparent upon consideration of the followingdetailed description of embodiments thereof, especially when taken inconjunction with the accompanying drawings wherein like referencenumerals in the various figures are utilized to designate likecomponents, and wherein:

FIG. 1 is a block diagram depicting a communication system in accordancewith an embodiment of the present invention;

FIG. 2 illustrates trajectory matching in accordance with an embodimentof the present invention;

FIG. 3 illustrates a method of calculating video frame delays, inaccordance with an embodiment of the present invention;

FIG. 4 illustrates a graphical depiction of frame delay estimation inaccordance with an embodiment of the present invention;

FIG. 5 illustrates a graphical depiction of a frame delay estimationtechnique in accordance with an embodiment of the present invention;

FIG. 6 illustrates a flow chart of a frame delay estimation process inaccordance with an embodiment of the present invention;

FIG. 7 illustrates a graphical depiction of a temporal offsetcomputation technique in accordance with an embodiment of the presentinvention; and

FIG. 8 illustrates a flow chart of a temporal offset computation processin accordance with an embodiment of the present invention.

The headings used herein are for organizational purposes only and arenot meant to be used to limit the scope of the description or theclaims. As used throughout this application, the word “may” is used in apermissive sense (i.e., meaning having the potential to), rather thanthe mandatory sense (i.e., meaning must). Similarly, the words“include”, “including”, and “includes” mean including but not limitedto. To facilitate understanding, like reference numerals have been used,where possible, to designate like elements common to the figures.Optional portions of the figures may be illustrated using dashed ordotted lines, unless the context of usage indicates otherwise.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The disclosure will be illustrated below in conjunction with anexemplary communication system. Although well suited for use with, e.g.,a system using a server(s) and/or database(s), the disclosure is notlimited to use with any particular type of communication system orconfiguration of system elements. Those skilled in the art willrecognize that the disclosed techniques may be used in any communicationapplication in which it is desirable to utilize computationally-lightmethods to detect video degradations.

The exemplary systems and methods of this disclosure will also bedescribed in relation to video conferencing software, modules, andassociated video conferencing hardware. However, to avoid unnecessarilyobscuring the present disclosure, the following description omitswell-known structures, components and devices that may be shown in blockdiagram form, are well known, or are otherwise summarized.

Embodiments in accordance with the present invention address the problemof detecting video frame delay degradation in real-time and in-service,to ensure end-to-end video quality in times of adverse networkconditions by taking appropriate counter-measures. Such quality ofservice (“QoS”) assurance mechanism requires light-weight video qualitymetrics that can be implemented with low computational and communicationoverheads. Embodiments herein describe a novel video quality metric forvideo conferencing-type applications that is accurate and light-weightfor real-time operations.

Ensuring end-to-end video frame delay may require the monitoring ofquality in real-time and in-service, and taking counter-measures intimes of adverse network conditions. Such application-layer QoSassurance mechanisms may require light-weight video metrics that can beimplemented with low computational and communication overheads.

Embodiments in accordance with the present invention provide a novelvideo metric for video conferencing-type applications that betterreflects user opinion at least as to quality, and is light-weight forreal-time operations. Embodiments in accordance with the presentinvention may operate by exploiting the characteristics of the videocontent in such applications, i.e. few speakers with limited motion.

In the following detailed description, numerous specific details are setforth in order to provide a thorough understanding of embodiments orother examples described herein. In some instances, well-known methods,procedures, components and circuits have not been described in detail,so as to not obscure the following description. Further, the examplesdisclosed are for exemplary purposes only and other examples may beemployed in lieu of, or in combination with, the examples disclosed. Itshould also be noted the examples presented herein should not beconstrued as limiting of the scope of embodiments of the presentinvention, as other equally effective examples are possible and likely.

The terms “switch,” “server,” “contact center server,” or “contactcenter computer server” as used herein should be understood to include aPrivate Branch Exchange (“PBX”), an ACD, an enterprise switch, or othertype of telecommunications system switch or server, as well as othertypes of processor-based communication control devices such as, but notlimited to, media servers, computers, adjuncts, and the like.

As used herein, the term “module” refers generally to a logical sequenceor association of steps, processes or components. For example, asoftware module may comprise a set of associated routines or subroutineswithin a computer program. Alternatively, a module may comprise asubstantially self-contained hardware device. A module may also comprisea logical set of processes irrespective of any software or hardwareimplementation.

As used herein, the term “gateway” may generally comprise any devicethat sends and receives data between devices. For example, a gateway maycomprise routers, switches, bridges, firewalls, other network elements,and the like, any and combination thereof.

As used herein, the term “transmitter” may generally comprise anydevice, circuit, or apparatus capable of transmitting an electricalsignal.

The term “computer-readable medium” as used herein refers to anytangible storage and/or transmission medium that participates in storingand/or providing instructions to a processor for execution. Such amedium may take many forms, including but not limited to, non-volatilemedia, volatile media, and transmission media. Non-volatile mediaincludes, for example, NVRAM, or magnetic or optical disks. Volatilemedia includes dynamic memory, such as main memory. Common forms ofcomputer-readable media include, for example, a floppy disk, a flexibledisk, hard disk, magnetic tape, or any other magnetic medium,magneto-optical medium, a CD-ROM, any other optical medium, punch cards,paper tape, any other physical medium with patterns of holes, RAM, PROM,EPROM, FLASH-EPROM, solid state medium like a memory card, any othermemory chip or cartridge, a carrier wave as described hereinafter, orany other medium from which a computer can read. A digital fileattachment to e-mail or other self-contained information archive or setof archives is considered a distribution medium equivalent to a tangiblestorage medium. When the computer-readable media is configured as adatabase, it is to be understood that the database may be any type ofdatabase, such as relational, hierarchical, object-oriented, and/or thelike. Accordingly, the disclosure is considered to include a tangiblestorage medium or distribution medium and prior art-recognizedequivalents and successor media, in which the software implementationsof the present disclosure are stored.

FIG. 1 depicts a communication system 100 which may be usable with anembodiment of the present disclosure. The communication system 100 mayinclude an enterprise network 104 that is in communication, via a(typically untrusted or unsecure or public) communication network 108,with one or more external communication devices 112. The externalcommunication devices 112 are generally referred to as “external”because they are either not under the direct control of the enterpriseadministering the enterprise network 104 or have a decreased level oftrust with the enterprise network 104 as compared with communicationdevices 136 that are within the enterprise network 104. Exemplary typesof external communication devices 112 include, without limitation,cellular phones, laptops, Personal Computers (PCs), Personal DigitalAssistants (PDAs), digital phones, analog phones, and the like.

The communication network 108 may be packet-switched and/orcircuit-switched. An exemplary communication network 108 includes,without limitation, a Wide Area Network (WAN), such as the Internet, aPublic Switched Telephone Network (PSTN), a Plain Old Telephone Service(POTS) network, a cellular communications network, or combinationsthereof. In one configuration, the communication network 108 is a publicnetwork supporting the TCP/IP suite of protocols.

The enterprise network 104 may include a boundary device 116 including aserver table 120, a communication server 124 including a call featuresequencer 128 and a user table 132, one or more internal communicationdevices 136, an anchor point server 140, one or more application servers144 which may be capable of providing one application 148 or a set ofdifferent applications 152, a number of other servers 156, and anenterprise database 160, all of which are interconnected by a (trustedor secure or private) Local Area Network (LAN) 164. Some or all of thefunctions depicted in FIG. 1 may be co-hosted and/or co-resident on asingle server. The depiction of components in FIG. 1 is generallyintended to be a logical depiction of the components of the system 100.

The LAN 164 can be secured from intrusion by untrusted parties by agateway and/or firewall located between the LAN 164 and communicationnetwork 108. In some embodiments the boundary device 116 may include thefunctionality of the gateway and/or firewall. In some embodiments, aseparate gateway or firewall may be provided between the boundary device116 and the communication network 108.

The communications server 124 can include a Private Branch eXchange(PBX), an enterprise switch, an enterprise server, combinations thereof,or other type of telecommunications system switch or server. Thecommunication server 124 is preferably configured to executetelecommunication functions such as the suite of or Avaya Aura™applications of Avaya, Inc., including Communication Manager™, AvayaAura Communication Manager™, Avaya IP Office™, Communication ManagerBranch™, Session Manager™ System Manager™, MultiVantage Express™, andcombinations thereof. Embodiments herein may refer to communicationserver 124 generically as a “session manager” for ease of reference.

Although only a single communications server 124 is depicted in FIG. 1,two or more communications servers 124 may be provided in a singleenterprise network 104 or across multiple separate LANs 164 owned andoperated by a single enterprise, but separated by a communicationnetwork 108. In configurations where an enterprise or an enterprisenetwork 104 includes two or more communications servers 124, each server124 may comprise similar functionality, but may be provisioned forproviding its features to only a subset of all enterprise users. Inparticular, a first communications server 124 may be authoritative forand services a first subset of enterprise users whereas a secondcommunications server 124 may be authoritative for and services a secondsubset of enterprise users, where the first and second subsets of usersgenerally do not share a common user. This is one reason why the networkboundary device 116 may be provided with a server table 120.

Additionally, multiple servers 124 can support a common user community.For example, in geo-redundant configurations and other applicationswhere users aren't necessarily bound to a single application server,there may be a cluster of equivalent servers where a user can beserviced by any server in the cluster.

In accordance with at least some embodiments of the present invention,the mapping of user identities within a communication request does notnecessarily have to occur at the network boundary device 116. Forinstance, the mapping between an authoritative server and a user mayoccur “behind” the network boundary device 116 within the enterprisenetwork 104.

In some embodiments, network boundary device 116 is responsible forinitially routing communications within the enterprise network 104 tothe communications server 124 responsible for servicing a particularuser involved in the communication. For example, if a first enterpriseuser is being called by an external communication device 112, then thenetwork boundary device 116 may initially receive the inbound call,determine that the call is directed toward the first enterprise user,reference the server table 120 to identify the authoritativecommunications server 124 for the first enterprise user, and route theinbound call to the authoritative communications server 124. Likewise,communications between internal enterprise users (e.g., internalcommunication devices 136) may first be serviced by the originatinguser's authoritative communications server 124 during the originationphase of communications set-up.

After the origination phase is complete, the authoritativecommunications server 124 of the terminating (or called) user may beinvoked to complete the termination phase of communications set-up. Insome embodiments, the communications server 124 for the originating andterminating user may be the same, but it is not necessarily requiredthat the server be the same. In situations where more than twoenterprise users are involved in a communication session, authoritativecommunications servers 124 for each of the involved users may beemployed without departing from the scope of the present invention.Additionally, the authoritative communications servers 124 for each usermay be in the same enterprise network 104 or in different enterprisenetworks 104, which are owned by a common enterprise but are separatedby the communication network 108.

Each communications server 124 includes a feature sequencer 128 and auser table 132. The user table 132 for a communications server 124contains the communication preferences for each user for which it isauthoritative. In particular, the user table 132 may be provisioned byusers and/or by administrative personnel. The communications preferencesfor a particular user are referenced by the feature sequencer 128 todetermine which, if any, features should be incorporated into acommunication session for the user. The feature sequencer 128 canactually provide communication features directly into the communicationsession or the feature sequencer 128 can determine an applicationsequence which will be invoked during set-up and used during thecommunication session.

In accordance with at least some embodiments, the feature sequencer 128can determine an application sequence and cause one or more applications148, 152 to be sequenced into a communication session. In particular,the feature sequencer 128 is configured to analyze a particular user'scommunication preferences and invoke the necessary applications tofulfill such preferences. Once an application sequence is determined bythe feature sequencer 128, the communications server 124 passes thecommunication-establishing message to a first application in theapplication sequence, thereby allowing the first application todetermine the parameters of the communication session, insert itselfinto the control and/or media stream of the communication session, andthereby bind itself to the communication session. Once the firstapplication has inserted itself into the communication session, thefirst application either passes the communication-establishing messageback to the feature sequencer 128 to identify the next application inthe application sequence or passes the communication-establishingmessage directly to a second application in the application sequence.Alternatively, or in addition, the message may be redirected, rejected,or the like. Moreover, parties and/or media servers may be added to thecall by an application. As can be appreciated, the process continuesuntil all applications have been included in the communication sessionand the process can be duplicated for each of the users involved in thecommunication session.

Although only two application servers 144 are depicted, one skilled inthe art will appreciate the one, two, three, or more applicationsservers 144 can be provided and each server may be configured to provideone or more applications. The applications provided by a particularapplication server 144 may vary depending upon the capabilities of theserver 144 and in the event that a particular application server 144comprises a set of applications 152, one, some, or all of theapplications in that set of applications 152 may be included in aparticular application sequence. There is no requirement, however, thatall applications in a set of applications 152 be included in anapplication sequence and there is no requirement as to the order inwhich applications are included in the application sequence. Rather, theapplication sequence is usually determined based on a user'scommunication preferences, which can be found in the user table 132.Alternatively, or in addition, the applications that appear in a user'ssequence vector and their order within that vector may be determined bya system administrator to satisfy business requirements.

Moreover, the application sequence can vary based on the media type(s)that are being used in the communication session. For instance, a usermay have a first set of preferences for voice-based communications, asecond set of preferences for video-based communications, and a thirdset of preferences for text-based communications. Additionally, a usermay have preferences defining preferred media types and rules forconverting communication sessions from one media type to anotherdifferent media type. Still further, a user may have preferencesdefining the manner in which multi-media communications are establishedand conducted.

The applications included in a particular application sequence aregenerally included to accommodate the user's preferences. Applicationsmay vary according to media-type, function, and the like. Exemplarytypes of applications include, without limitation, an EC-500 (extensionto cellular) application, a call setup application, a voicemailapplication, an email application, a voice application, a videoapplication, a text application, a conferencing application, a callrecording application, a communication log service, a securityapplication, an encryption application, a collaboration application, awhiteboard application, mobility applications, presence applications,media applications, messaging applications, bridging applications, andany other type of application that can supplement or enhancecommunications. Additionally, one, two, three, or more applications of agiven type can be included in a single application sequence withoutdeparting from the scope of the present invention.

Embodiments in accordance with the present invention provide a novelvideo metric using facial trajectory analysis, which can be used forreal-time detection of end-to-end video frame delays of videoconferencing-type applications. Facial trajectory includes recordinglocations of boxes that encapsulate faces found in a frame, a trajectoryrecorded as a sequence of x and y coordinates with timestamps. Facialtrajectory analysis may exploit certain characteristics of video contentin such video conferencing-type applications, i.e., video conferenceshaving few speakers with limited motion.

Embodiments in accordance with the present invention exploit the natureof video conferencing applications. A video conference includes a streamof frames captured by a source side camera, which are sent to adestination to form a received stream of frames. Typically in a videoconference there is at least one face in the video. Faces may bedetected in a frame of the video, and detected again in at least onemore frame of the video. A detected face may be characterized by abounding box that encloses the detected face. The bounding box isassociated with certain characteristics such as its location and size.The location may be, for example, coordinates of one predeterminedcorner of the bounding box (e.g., the lower left corner) relative to acoordinate system of the video display. The size may be, for example,the width and/or height of the bounding box relative to the coordinatesystem of the video display. In some embodiments, the size of thebounding box may be constant.

Embodiments in accordance with the present invention will be describedin reference to facial detection. However, it should be understood thatdetection of other moving or movable objects or features that may beavailable in the video stream may be used to detect video frame delays.

Speakers in a video stream often move their heads and/or faces. Forexample, a speaker may turn to look around a room, or turn side-to-sideto view a teleprompter, or move their mouth and/or jaw while speaking.Therefore, the location and/or size of a bounding box encompassing thespeaker's face will change over time. A record of the change in thelocation and/or size of a bounding box over time is referred to as atrajectory. The trajectory may be a vector quantity that includes both amagnitude (i.e., speed) and a direction of change in the two-dimensionalcoordinates of a screen display.

Typically, facial detection is performed regularly, such asapproximately once per second. The rate that facial detection isperformed may depend upon the speed of motion in the video. For example,video of a lecture may utilize a slower rate of facial detection thanvideo of a debate or a musical performance. Movement of one or moredetected faces in both the sent video stream and the received videostream may be computed by comparing bounding boxes of the detected facesamong video frames of the respective video streams. A similarity betweenfacial trajectories in sent and received video streams is a key propertythat can be used to infer delay information.

Embodiments in accordance with the present invention rely upon capturingsent and received streams of video frames, the video frames having beentime-stamped with respect to a common event such as the signaling ofcall start at the source and destination, so that facial trajectorymeasurements are collected at substantially the same time between thesent and received sides. In one embodiment, one or both of the videostreams may be obtained from within the video system itself, such as afork of the media stream. In another embodiment, an external camera inview of a display monitor showing the sent or received video stream maybe used to create a separate video stream used for analytic purposes. Ifone or more external cameras are used, the cameras need not grab anexactly same part of the frames on both the sending and receiving sides,but the size of the captured face with respect to the frame size shouldbe similar between the sending and receiving sides, and the angle ofcamera placement should be similar. Accordingly, relative timestampswith respect to a common event may be obtained from within the videosystem as well as by using computers attached to external cameras forframe capture.

In some embodiments, clocks at the transmitting end and the receivingend of the media stream may not be strictly synchronized. Relative clocksynchronization between the sent and received sides may be sufficientlyprovided in a number of ways including using Network Time Protocol(“NTP”) or a process based on message exchanges between agents at thesent and received sides, as known in the art. In other embodiments,usage of strictly synchronized timestamps among the transmitting andreceiving terminals may improve the analysis by an analytic engine, atleast by reducing a frame delay estimation error. In other embodiments,compensating for relative difference in signaling delay to the sourceand destination may improve the analysis. Relative clock drift may becorrected by periodic recalibrations.

Embodiments in accordance with the present invention rely upon anobservation that at least for some types of video media content, thefacial trajectories between sent and received video streams should besubstantially the same except by an offset in time caused by delays intransporting the video stream from the transmitting end to the receivingend. The offset in time between the sent and received video streams isan estimate of the end-to-end frame delay. The sent and received streamsof coordinates of the box locations with relative timestamps form thefacial trajectories.

Embodiments in accordance with the present invention utilize techniquesto synchronize two streams of box locations with timestamps pertainingto sent and received frames. Within a range of likely synchronizationtimes, embodiments select a time value that tends to minimize adifference in characteristics of bounding boxes (e.g., x and ycoordinates, or width and height) of the sent and received video streamsacross multiple video frames. The selected time value is considered anestimate of the video frame delay.

Facial trajectory analysis is less computationally or resource-usageintensive compared to methods in the background art. Facial trajectoryanalysis may be usable with a variety of video codecs, and methods inaccordance with an embodiment of the present invention are interoperablebetween different video codecs. For example, suppose a more powerfulvideo codec is used on one end, e.g., a computationally lightweightmobile device is being used to view a conventional video call.Embodiments may provide an intermediate transcoding conversion step froma conference-quality video system to a handheld device or smartphone, inwhich some changes may be made to the video for the person with thehandheld device to view the video. Transcoding may involve changingvideo parameters, which in turn may add additional delay to the networkdelay that the video traffic is experiencing.

Since facial trajectory analysis uses decoded frames in the computation,facial trajectory analysis takes into account decoder-specific errorconcealment capabilities. Facial trajectory analysis relies on detectingthe trajectory of the location of a speaker's face in sent and receivedvideo frames, as determined by the characteristics of a box thatencloses the detected face, and comparing the trajectories to identifyend-to-end video frame delays. The box characteristics from thetransmitting and/or receiving endpoints may then be transmitted to apredetermined analytic location for processing (e.g., to thetransmitting side, to the receiving side, to an intermediate point suchas a central server, etc.). An intermediate point may be, e.g.,Application 148 in Application Server 144 of FIG. 1. Transmission of boxcharacteristics incurs minimal overhead or burden because of therelatively small amount of data involved compared to the video streamitself. Facial trajectory may be computed at an intermediate server orat one of the endpoints using a stream of the calculated boxcharacteristics and associated timestamps provided by the sending andreceiving endpoints.

Detecting the location of the speaker's face in video frames, by acomparator such as a comparator module in a telecommunications endpointor a comparator application executed by application server 144 of FIG.1, and calculation of one or more facial trajectories of the detectedface, is a process that ordinarily should be completed quickly (i.e.,less than a second), and incur small computational overhead. A separatereliable channel such as a TCP/IP channel may be used to transmit thereduced reference features (i.e., box locations that frame identifiedfaces) in-service (i.e., during the teleconference), to thepredetermined analytic location.

Another advantage of embodiments in accordance with the presentinvention is that by applying the embodiments to video conferencing-typeapplications, face locations in consecutive frames are not likely tochange drastically. Therefore a relatively low-grade sampling rate(i.e., less than the full frame rate of the video signal, for instanceat three to four frames per second) is sufficient for use by facialtrajectory analysis.

Face detection is performed using processes known in the art, producinga bounding box that encapsulates each detected face found in the frame.Face detection techniques are described in U.S. patent application Ser.No. 13/588,637, which is incorporated herein by reference in itsentirety.

Various processes may be used for comparing and/or matching trajectoriesof bounding boxes. For example, a process based upon dynamic timewarping may be used to match segments of trajectories in order toproduce an estimated delay. However, a standard dynamic time warpingprocess known in the art when applied to trajectories would produce aresolution of one probing interval (e.g., about 350 milliseconds), whichis too large when trying to estimate and/or correct for video framedelays. Probing interval in this context refers to the time between twoconsecutive face location measurements. In other embodiments, comparisonof trajectories may include, for example, a correlation calculationbetween a trajectory over time derived from a transmitted video signaland a trajectory over time derived from a received video signal.

FIG. 2 illustrates example trajectories of face locations in the x-axis,in sent and received video streams. In this example, we assume there isonly one face in the video frames, but a person of ordinary skill in theart will recognize how to extend the scenario of FIG. 2 to a case ofmultiple faces. In FIG. 2, the x-coordinate as a function of time of abox encapsulating a face in sent stream 201 is shown with solid lines.Similarly, the x-coordinate as a function of time of a box encapsulatinga face in received stream 203 is shown with dashed lines. The timesreported for the coordinates of both streams are relative times withrespect to a common event such as call start time. Not shown in thefigure are y-coordinates, which also may be used in the analysis.

In this example, if received stream 203 is plotted with an offset equalto the relative frame delay, then the plot of the received stream 203would substantially overlap the plot of the sent stream 201. Embodimentsin accordance with the present invention may try to find a value for therelative frame delay that tends to minimize the difference between thetwo plots when so shifted in time.

Embodiments in accordance with the present invention may match thetrajectory associated with received stream 203 to the trajectoryassociated with sent stream 201. For example, a change in velocity inthe sent stream 201 may occur at τ₁. A corresponding change in velocityin the received stream 203 may be observed at τ₂. A processing modulewhich receives both the sent stream 201 and received stream 203, andwhich determines that the event occurring at τ₁ corresponds to the eventoccurring at τ₂, may then calculate a relative frame delay as beingequal to τ₂−τ₁.

FIG. 3 illustrates a method 300 in accordance with an embodiment of thepresent invention. At step 301, reference features that arecharacteristic of content in the received video stream are located. Atstep 303, a processor is used to calculate reduced reference featuresfrom the located reference features. At step 305, reduced referencefeatures of a transmitted video stream are received, the transmittedvideo stream corresponding to the received video stream. At step 307, aprocessor is used to calculate a received trajectory of the reducedreference features from the received video stream. At step 309, aprocessor is used to calculate a transmitted trajectory of the reducedreference features from the transmitted video stream. At step 311, aprocessor is used to calculate video frame delay as a time shift betweenthe received trajectory and the transmitted trajectory.

A prototype system has been developed that performs a video capture andface detection process at periodic intervals such as approximately every350 milliseconds on both the sent side and received side. At longerintervals, e.g., approximately 15 second intervals, trajectories arecomputed using the intervening face detection results. Ten to fifteenseconds has been found to be a sufficient time interval to obtain anadequate facial trajectory. A matching process is then used to match thereceived trajectories to the transmitted trajectories and calculate aframe delay during that 15-second interval.

In some circumstances there may be no motion by a speaker during a15-second interval, during which a delay estimate cannot beunambiguously updated from the estimate during the previous 15-secondinterval. However, once the speaker performs some detectable motion,such as turning their head or moving their mouth, an event is created inthe sent trajectory which is sufficient to match to the receivedtrajectory. If the video is unchanged for a longer period of time, theprobability increases that there is a problem or interruption in thevideo stream which has caused either the sent or received video streamto freeze. Therefore, a secondary benefit of certain embodiments is anability to detect problems causing frozen video.

If multiple trajectories are detected in a video stream (e.g., multiplefaces and/or multiple characteristics derived from a single boundingbox), embodiments in accordance with the present invention may estimatethe video frame delay by attempting to maximize a total correlationamong all trajectories. In some embodiments, characteristics may begiven unequal weighting (e.g., bounding box location may be given moreweight than bounding box width or height; bounding boxes in the centerof the video stream may be given more weight than bounding boxes at theperiphery of the video stream, etc.). If the number of faces change,e.g., in a video feed of a panel discussion that switches back and forthbetween a wide view of the entire panel and a close-up view of a singlespeaker, embodiments may use frames with either the close-up views orthe wide views. In this situation, face size may be used to choose whichframes to use.

Embodiments in accordance with the present invention match faces foundin received frames to faces found in sent frames, by selecting pairs ofcoordinates with minimum distances between them. Face detectionsoftware, due to inherent limitations of known processes, may find moreor fewer faces than are actually present in the video. If the number ofspeakers is known to be k, then up to k such distance values areselected pertaining to each face found in the frame. In cases where aface is not detected in a sent frame, a trajectory may be interrupted.

FIG. 4 illustrates an exemplary graphical depiction 400 of a senttrajectory 402 and a received trajectory 404 of an example video stream.The x-axis represents time, and the y-axis represents normalizedposition in a predetermined coordinate. As illustrated, the coordinateis the “X” (i.e., horizontal) direction, but could also be the “Y”(i.e., vertical) direction, or a function of one or both (e.g.,Euclidean distance SQRT(X²+Y²)). The frame delay estimation process mayinclude multiple instances of graphical depiction 400, each depicting atrajectory of a different metric (e.g., separate trajectories in thex-axis and y-axis). Normalization substantially compensates fordifferences in screen size and resolution between the sending andreceiving sides. The y-axis may also represent other metrics such asbounding box width and/or height. Trajectory 402 represents thenormalized position or metric at one side of the media stream (e.g., thesending side) and trajectory 404 represents the normalized position ormetric at the other side of the media stream (e.g., the receiving side).

The prototype relies on a sample of the frames that are selected atperiodic intervals at both the sending and receiving sides and processedto find face locations. Probing interval refers to the time between twoconsecutive face location measurements at the sending or the receivingside. The x-axis of depiction 400 shows probing intervals 406 and theresulting face location measurements extracted from selected frames asfilled circles at both the sending and receiving sides. Note that duringperiodic sampling at the sending and receiving sides, different frameswill be selected. Hence the trajectories extracted at the sending andreceiving sides will not be identical but similar. Facial trajectoriesmay be extracted from the sent and received media streams, and facialtrajectories or other features so extracted from the sent and receivedmedia streams may be time-stamped with respect to an event that iscommon to both the sent and received media streams (e.g., a movement, achange in camera shot, etc.).

FIG. 5 graphically illustrates a frame delay estimation process 500 inaccordance with an embodiment of the present invention. Process 500 maybegin when “N” metrics such as bounding box coordinates have beendetected from the transmitted video feed, along with associatedtimestamps. Furthermore, assume that “M” metrics have been detected fromthe received video feed, along with associated timestamps. OrdinarilyM=N, but they may not be necessarily equal when the video is noisy orthe video includes a cutaway to a different scene, perspective, or soforth. Let the transmitted metrics be represented as a set of 3-tuplevectors (X^(S)i, Y^(S)i, T^(S)i) for 1≦i≦N, and let the received metricsbe represented as a set of 3-tuple vectors (X^(R)j, Y^(R)j, T^(R) _(j))for 1≦j≦M.

Video frame delay estimation may be calculated from trajectories 502 and504 by a processor of system 100. In particular, the frame delayestimation may be calculated by usage of curve-matching processes suchas a dynamic time warping (“DTW”) process known in the art. The frameestimation may be produced at time intervals that are sufficientlyfrequent with respect to variations in the sources of video framedelays. For such purposes, once every 15-30 seconds may be consideredsufficiently real-time.

Process 500 is illustrated in further detail in FIG. 6. Process 500begins at step 601, during which, for sent and received segments, aprocessor such as server 144 computes speed in pixels per secondsnormalized to the frame size. A segment in this context refers to thepart of a trajectory between two consecutive face detectionmeasurements.

Process 500 transitions to step 603, at which a processor such as server144 matches sent-side and received-side segments. A process such as DTWmay be used in the matching, using trajectory speed in the segment as a“cost” for purposes of computing DTW.

Process 500 transitions to step 605, at which a processor such as server144 computes temporal offset within matched segments.

FIG. 7 graphically illustrates a temporal offset computation process 700in accordance with an embodiment of the present invention. Process 700as illustrated assumes a temporal offset 706 (denoted as “O”) and aspatial offset 708 (denoted as “A”).

Process 700 is illustrated in further detail in FIG. 8. Process 700begins at step 801, during which, for each pair of sent and receivedsegments matched by the DTW algorithm, a smoothing function such as anaverage of adjacent face location measurements may be made. For example,smoothing calculations may be carried out in accordance with Equations(1) and (2) for the sending and receiving sides, respectively:

$\begin{matrix}{X_{i}^{S} = \left( \frac{x_{i\;}^{S} + x_{i + 1}^{S}}{2} \right)} & (1) \\{X_{j}^{R} = \left( \frac{x_{j}^{R} + x_{j + 1}^{R}}{2} \right)} & (2)\end{matrix}$

Process 700 then transitions to step 803, at which the temporal offsetis determined for each pair of sent and received segments matched by theDTW algorithm. Assuming the temporal offset and the spatial offset donot substantially change within a short time period over which the framedelay metric is estimated, temporal offset for a pair of matchedsegments may be calculated in accordance with Equation (3) where v_(i)is the reference facial speed at the i^(th) matched segment:

$\begin{matrix}{O = \left( \frac{X_{j}^{S} - X_{i}^{R} - A}{v_{i}} \right)} & (3)\end{matrix}$

Process 700 then transitions to step 805 where the overall temporaloffset may be calculated as, for example, a median or a mean of alltemporal offsets computed for each matched pair of segments. Otherstatistical measures known in the art may also be used.

Embodiments in accordance with the present invention may employ variouspatterns for selecting frames periodically from a video stream andprocessing them for face locations. One example pattern is to select aframe at equal probing intervals such as one frame every 350 ms.Alternatively, a probing pattern may alternate between a short probinginterval (such as selecting a frame in the next 350 ms.) and a longprobing interval (such as selecting a frame in the next 500 ms.). Thegoal of the probing pattern is to sample only a few of the frames insuch a way that the sent and received trajectories extracted using thesefew frames closely resemble those trajectories that would be extractedif all the frames were used.

The disclosed methods may be readily implemented in software, such as byusing object or object-oriented software development environments thatprovide portable source code that can be used on a variety of computeror workstation platforms. Alternatively, the disclosed system may beimplemented partially or fully in hardware, such as by using standardlogic circuits or VLSI design. Whether software or hardware may be usedto implement the systems in accordance with various embodiments of thepresent invention may be dependent on various considerations, such asthe speed or efficiency requirements of the system, the particularfunction, and the particular software or hardware systems beingutilized.

While the foregoing is directed to embodiments of the present invention,other and further embodiments of the present invention may be devisedwithout departing from the basic scope thereof. It is understood thatvarious embodiments described herein may be utilized in combination withany other embodiment described, without departing from the scopecontained herein. Further, the foregoing description is not intended tobe exhaustive or to limit the invention to the precise form disclosed.Modifications and variations are possible in light of the aboveteachings or may be acquired from practice of the invention. Certainexemplary embodiments may be identified by use of an open-ended listthat includes wording to indicate that the list items are representativeof the embodiments and that the list is not intended to represent aclosed list exclusive of further embodiments. Such wording may include“e.g.,” “etc.,” “such as,” “for example,” “and so forth,” “and thelike,” etc., and other wording as will be apparent from the surroundingcontext.

No element, act, or instruction used in the description of the presentapplication should be construed as critical or essential to theinvention unless explicitly described as such. Also, as used herein, thearticle “a” is intended to include one or more items. Where only oneitem is intended, the term “one” or similar language is used. Further,the terms “any of followed by a listing of a plurality of items and/or aplurality of categories of items, as used herein, are intended toinclude “any of,” “any combination of,” “any multiple of,” and/or “anycombination of multiples of the items and/or the categories of items,individually or in conjunction with other items and/or other categoriesof items.

Moreover, the claims should not be read as limited to the describedorder or elements unless stated to that effect. In addition, use of theterm “means” in any claim is intended to invoke 35 U.S.C. §112, ¶6, andany claim without the word “means” is not so intended.

What is claimed is:
 1. A method to calculate video frame delay in avideo stream received by a telecommunications endpoint, the methodcomprising: locating, by a processor, reference features that arecharacteristic of content in the received video stream; calculating, bythe processor, reduced reference features from the located referencefeatures; receiving, by the processor, reduced reference features of atransmitted video stream, the transmitted video stream corresponding tothe received video stream; calculating, by the processor, a receivedtrajectory of the reduced reference features from the received videostream; calculating, by the processor, a transmitted trajectory of thereduced reference features from the transmitted video stream; andcalculating, by the processor, frame delay as a time shift between thereceived trajectory and the transmitted trajectory.
 2. The method ofclaim 1, wherein the reference features that are characteristic ofcontent comprise faces.
 3. The method of claim 1, wherein the reducedreference features from the received video stream and the transmittedvideo stream comprise locations of the reference features.
 4. The methodof claim 1, wherein the reduced reference features from the receivedvideo stream and the transmitted video stream comprise a location ofrectangular areas surrounding the reference features.
 5. The method ofclaim 1, wherein the reduced reference features from the received videostream are calculated at less than a full frame rate of the receivedvideo stream.
 6. The method of claim 1, wherein the reduced referencefeatures from the received video stream are calculated at no more thanfive frames per second.
 7. The method of claim 1, wherein the reducedreference features of the transmitted video stream are received via acommunication channel that is separate from a communication channel usedto transport the transmitted video stream.
 8. The method of claim 1,further comprising: receiving, by the processor, a frame of thetransmitted video stream to the telecommunications endpoint, once persecond, via a reliable channel; and calculating, by the processor,reduced reference features from the frame.
 9. The method of claim 1,wherein the received video stream originates from a remotely-locatedcamera accessible via a communication path.
 10. The method of claim 1,further comprising: detecting, by the processor, an absence of referencefeatures from the received video stream.
 11. A system to calculate videoframe delay in a video stream received by a telecommunications endpoint,the system comprising: a computer-readable storage medium, storingexecutable instructions; and a processor coupled to thecomputer-readable storage medium, the processor, when executing theexecutable instructions: locates reference features that arecharacteristic of content in the received video stream; calculatesreduced reference features from the located reference features; receivesreduced reference features of a transmitted video stream, thetransmitted video stream corresponding to the received video stream;calculates a distance between the reduced reference features in thereceived video stream and the reduced reference features of thetransmitted video stream; calculates a received trajectory of thereduced reference features from the received video stream; calculates atransmitted trajectory of the reduced reference features from thetransmitted video stream; and calculates video frame delay as a timeshift between the received trajectory and the transmitted trajectory.12. The system of claim 11, wherein the reference features that arecharacteristic of content comprise faces.
 13. The system of claim 11,wherein the reduced reference features from the received video streamand the transmitted video stream comprise a location of rectangularareas surrounding the reference features.
 14. The system of claim 11,wherein the reduced reference features from the received video streamare calculated at less than a full frame rate of the received videostream.
 15. The system of claim 11, wherein the reduced referencefeatures from the received video stream are calculated at no more thanfive frames per second.
 16. The system of claim 11, wherein the reducedreference features of the transmitted video stream are received via acommunication channel that is separate from a communication channel usedto transport the transmitted video stream.
 17. The system of claim 11,wherein the received video stream originates from a remotely-locatedcamera accessible via a communication path.
 18. The system of claim 11,wherein the processor, when executing the executable instructions:detects an absence of reference features from the received video stream.